Persistent Rnns: Stashing Weights On-chip

نویسندگان

  • Greg Diamos
  • Shubho Sengupta
  • Bryan Catanzaro
  • Mike Chrzanowski
  • Adam Coates
  • Erich Elsen
  • Jesse Engel
  • Awni Hannun
  • Sanjeev Satheesh
چکیده

This paper introduces a framework for mapping Recurrent Neural Network (RNN) architectures efficiently onto parallel processors such as GPUs. Key to our approach is the use of persistent computational kernels that exploit the processor’s memory hierarchy to reuse network weights over multiple timesteps. Using our framework, we show how it is possible to achieve substantially higher computational throughput at lower mini-batch sizes than direct implementations of RNNs based on matrix multiplications. Our initial implementation achieves 2.8 TFLOP/s at a mini-batch size of 4 on an NVIDIA TitanX GPU, which is about 45% of theoretical peak throughput, and is 30X faster than a standard RNN implementation based on optimized GEMM kernels at this batch size. Reducing the batch size from 64 to 4 per processor provides a 16x reduction in activation memory footprint, enables strong scaling to 16x more GPUs using data-parallelism, and allows us to efficiently explore end-to-end speech recognition models with up to 108 residual RNN layers.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Persistent RNNs: Stashing Recurrent Weights On-Chip

This paper introduces a new technique for mapping Deep Recurrent Neural Networks (RNN) efficiently onto GPUs. We show how it is possible to achieve substantially higher computational throughput at low mini-batch sizes than direct implementations of RNNs based on matrix multiplications. The key to our approach is the use of persistent computational kernels that exploit the GPU’s inverted memory ...

متن کامل

DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks

Recently, deep learning with convolutional neural networks (CNNs) and recurrent neural networks (RNNs) has become universal in all-around applications. CNNs are used to support vision recognition and processing, and RNNs are able to recognize time varying entities and to support generative models. Also, combining both CNNs and RNNs can recognize time varying visual entities, such as action and ...

متن کامل

Sparse Persistent Rnns: Squeezing Large Recurrent Networks On- Chip

Recurrent Neural Networks (RNNs) are powerful tools for solving sequence-based problems, but their efficacy and execution time are dependent on the size of the network. Following recent work in simplifying these networks with model pruning and a novel mapping of work onto GPUs, we design an efficient implementation for sparse RNNs. We investigate several optimizations and tradeoffs: Lamport tim...

متن کامل

Block-Sparse Recurrent Neural Networks

Recurrent Neural Networks (RNNs) are used in state-of-the-art models in domains such as speech recognition, machine translation, and language modelling. Sparsity is a technique to reduce compute and memory requirements of deep learning models. Sparse RNNs are easier to deploy on devices and high-end server processors. Even though sparse operations need less compute and memory relative to their ...

متن کامل

Fast Weight Long Short-term Memory

Associative memory using fast weights is a short-term memory mechanism that substantially improves the memory capacity and time scale of recurrent neural networks (RNNs). As recent studies introduced fast weights only to regular RNNs, it is unknown whether fast weight memory is beneficial to gated RNNs. In this work, we report a significant synergy between long short-term memory (LSTM) networks...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016